Introduction

This report explores the relationship between various attributes and how walkable a certain region of Florida is on average. For this project data was collected from a dataset called the National Walkability Index on the EPA’s website (Environmental Protection Agency, 2021). The dataset was filtered so that only districts in the state of Florida were shown and the dataset was divided into groups based on their CBSA (Core-based statistical area). Next, summary statistics were computed for the region so that the new dataset includes the average National Walkability Index score (AvgNWI), average percentage of population that is working-age (AvgP_Wrk), average district population (AvgDisPop), total district population (TotCPop), and average percentage of low-wage workers (AvgP_LowW) for all the districts in each CBSA region. Each of these attributes serves a distinct purpose in evaluating aspects related to walkability, demographics, and economic characteristics within the selected regions. I predict that areas with larger average working-aged and smaller average low-waged population percentages will have higher walkability scores on average.

My original plan was to make three plots, an interactive scatter plot, a choropleth map, and a heatmap. In the end, I decided to make these plots in addition to a coefficients plot. My first plot is an interactive scatter plot of AvgP_Wrk vs AvgP_LowW with AvgNWI color-coded. The second figure that I made is a choropleth map that shows the AvgNWI across different CBSA regions in Florida. Next, I made a coefficients plot from a multiple linear regression model predicting AvgNWI. I was motivated to create this plot because I wanted to see how various variables impact AvgNWI. Lastly, I made a heatmap showing the correlations between AvgNWI and the other selected attributes.

Load Packages and Files

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(sf)
## Warning: package 'sf' was built under R version 4.3.3
## Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.3.3
library(terra)
## Warning: package 'terra' was built under R version 4.3.3
## terra 1.7.78
## 
## Attaching package: 'terra'
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
library(htmlwidgets)
## Warning: package 'htmlwidgets' was built under R version 4.3.3
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.3
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(broom)


walkability <- read_csv("../data/EPA_SmartLocationDatabase_V3_Jan_2021_Final.csv")
## Rows: 220740 Columns: 117
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (2): CSA_Name, CBSA_Name
## dbl (115): OBJECTID, GEOID10, GEOID20, STATEFP, COUNTYFP, TRACTCE, BLKGRPCE,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gdb_path <- "../data/Natl_WI.gdb"
layers <- st_layers(gdb_path)
layer_name <- layers$name[1]
nwi <- st_read(gdb_path, layer = layer_name)
## Reading layer `NationalWalkabilityIndex' from data source 
##   `C:\Users\Jackie\Downloads\dataviz_mini-project_02\dataviz_mini-project_02\dataviz_mini-project_02\data\Natl_WI.gdb' 
##   using driver `OpenFileGDB'
## Simple feature collection with 220739 features and 29 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -10434580 ymin: -83867.97 xmax: 3407868 ymax: 6755033
## Projected CRS: USA_Contiguous_Albers_Equal_Area_Conic_USGS_version

Summarize the data

flwalkability <- walkability %>%
  filter(STATEFP == '12') %>%
  group_by(CBSA_Name) %>%
  summarize(AvgNWI = mean(NatWalkInd, na.rm = TRUE),
            AvgP_Wrk = mean(P_WrkAge, na.rm = TRUE),
            AvgDisPop = mean(TotPop, na.rm = TRUE),
            TotCPop = sum(TotPop, na.rm = TRUE),
            AvgP_LowW = mean(R_PCTLOWWAGE, na.rm = TRUE)
            )
flwalkability
## # A tibble: 30 × 6
##    CBSA_Name                         AvgNWI AvgP_Wrk AvgDisPop TotCPop AvgP_LowW
##    <chr>                              <dbl>    <dbl>     <dbl>   <dbl>     <dbl>
##  1 Arcadia, FL                         6.01    0.559     1400.   36399     0.263
##  2 Cape Coral-Fort Myers, FL           9.42    0.515     1398.  718679     0.249
##  3 Clewiston, FL                       6.17    0.552     1605.   40127     0.255
##  4 Crestview-Fort Walton Beach-Dest…   7.26    0.589     1646.  266595     0.258
##  5 Deltona-Daytona Beach-Ormond Bea…  10.1     0.554     1862.  634773     0.270
##  6 Gainesville, FL                     9.11    0.624     1628.  320724     0.253
##  7 Homosassa Springs, FL               6.87    0.481     1626.  143087     0.267
##  8 Jacksonville, FL                   10.2     0.601     2093. 1475386     0.245
##  9 Key West, FL                       10.9     0.601     1004.   76325     0.235
## 10 Lake City, FL                       6.49    0.570     1728.   69105     0.271
## # ℹ 20 more rows
# Join the summarized data to the map for Florida
florida_nwi <- nwi[nwi$STATEFP == '12', ]
florida_nwi <- florida_nwi[!is.na(florida_nwi$NatWalkInd), ]
users_map <- florida_nwi %>%
  left_join(flwalkability, by = "CBSA_Name")
users_map
## Simple feature collection with 11442 features and 34 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 796752.4 ymin: 259071.7 xmax: 1612207 ymax: 961154.4
## Projected CRS: USA_Contiguous_Albers_Equal_Area_Conic_USGS_version
## First 10 features:
##         GEOID10      GEOID20 STATEFP COUNTYFP TRACTCE BLKGRPCE CSA
## 1  121170221051 121170221051      12      117  022105        1 422
## 2  120710104093 120710104093      12      071  010409        3 163
## 3  120710104104 120710104104      12      071  010410        4 163
## 4  120710104103 120710104103      12      071  010410        3 163
## 5  120860003071 120860003071      12      086  000307        1 370
## 6  120950147031 120950147031      12      095  014703        1 422
## 7  120330036144 120330036144      12      033  003614        4 426
## 8  120339900000 120339900000      12      033  990000        0 426
## 9  120710104041 120710104041      12      071  010404        1 163
## 10 120710104042 120710104042      12      071  010404        2 163
##                                    CSA_Name  CBSA
## 1              Orlando-Lakeland-Deltona, FL 36740
## 2          Cape Coral-Fort Myers-Naples, FL 15980
## 3          Cape Coral-Fort Myers-Naples, FL 15980
## 4          Cape Coral-Fort Myers-Naples, FL 15980
## 5  Miami-Port St. Lucie-Fort Lauderdale, FL 33100
## 6              Orlando-Lakeland-Deltona, FL 36740
## 7               Pensacola-Ferry Pass, FL-AL 37860
## 8               Pensacola-Ferry Pass, FL-AL 37860
## 9          Cape Coral-Fort Myers-Naples, FL 15980
## 10         Cape Coral-Fort Myers-Naples, FL 15980
##                                  CBSA_Name    Ac_Total    Ac_Water    Ac_Land
## 1            Orlando-Kissimmee-Sanford, FL   377.49665     0.00000  377.49665
## 2                Cape Coral-Fort Myers, FL   691.21604    25.29563  665.92042
## 3                Cape Coral-Fort Myers, FL   625.08763    53.62688  571.46076
## 4                Cape Coral-Fort Myers, FL  1610.68231   191.16595 1419.51636
## 5  Miami-Fort Lauderdale-Pompano Beach, FL    99.85187     0.00000   99.85187
## 6            Orlando-Kissimmee-Sanford, FL   423.49975    18.13615  405.36360
## 7           Pensacola-Ferry Pass-Brent, FL  3826.38748    15.06668 3811.32081
## 8           Pensacola-Ferry Pass-Brent, FL 75522.48436 75522.48436    0.00000
## 9                Cape Coral-Fort Myers, FL   816.05813    70.99751  745.06061
## 10               Cape Coral-Fort Myers, FL   804.52992    56.96297  747.56695
##       Ac_Unpr TotPop CountHU   HH Workers D2B_E8MIXA D2A_EPHHM        D3B
## 1   377.49665   1571     747  643    1012  0.6101164 0.3411523  91.587780
## 2   665.92042   1880     877  688     876  0.5984436 0.4112783  40.060523
## 3   571.46076   1875     807  807     901  0.5047176 0.5293845  69.835907
## 4  1419.51636   4061    1971 1582    1632  0.5229428 0.2998326  46.905440
## 5    99.85187   1658     384  364     619  0.2788802 0.1725376 126.068339
## 6   405.36360   2491    1155  982    1395  0.5964563 0.5454343  97.397991
## 7  3807.37991   2032     688  524     563  0.7117882 0.7462556   6.047835
## 8     0.00000      0       0    0       0  0.0000000 0.0000000   0.000000
## 9   745.06061   2809    1151  905    1224  0.6974998 0.5202777  53.853551
## 10  747.56695   2297    1073  938    1297  0.5069232 0.2965802  77.940310
##          D4A D2A_Ranked D2B_Ranked D3B_Ranked D4A_Ranked NatWalkInd
## 1  -99999.00          6         12         13          1   7.666667
## 2  -99999.00          8         12          8          1   6.333333
## 3  -99999.00         11          8         11          1   7.166667
## 4     807.35          5          9          9         14  10.000000
## 5     355.40          2          3         16         17  11.833333
## 6     584.73         12         11         14         15  13.500000
## 7  -99999.00         17         16          4          1   7.166667
## 8  -99999.00          1          1          1          1   1.000000
## 9  -99999.00         11         15         10          1   8.000000
## 10 -99999.00          5          8         12          1   6.500000
##    Shape_Length Shape_Area    AvgNWI  AvgP_Wrk AvgDisPop TotCPop AvgP_LowW
## 1      5858.584    1527709 10.190048 0.6075743  2937.963 2450261 0.2433612
## 2      6683.597    2797313  9.420233 0.5153093  1398.208  718679 0.2486237
## 3      6436.426    2529690  9.420233 0.5153093  1398.208  718679 0.2486237
## 4     10926.322    6518339  9.420233 0.5153093  1398.208  718679 0.2486237
## 5      2543.113     404094 12.707212 0.5918418  1775.130 6070944 0.2214451
## 6      5224.096    1713882 10.190048 0.6075743  2937.963 2450261 0.2433612
## 7     18751.923   15485177  9.039653 0.5996245  1791.688  481964 0.2577461
## 8    128518.435  305636842  9.039653 0.5996245  1791.688  481964 0.2577461
## 9      8180.276    3302543  9.420233 0.5153093  1398.208  718679 0.2486237
## 10     7837.987    3255888  9.420233 0.5153093  1398.208  718679 0.2486237
##                             Shape
## 1  MULTIPOLYGON (((1433116 731...
## 2  MULTIPOLYGON (((1398559 499...
## 3  MULTIPOLYGON (((1398823 497...
## 4  MULTIPOLYGON (((1396073 493...
## 5  MULTIPOLYGON (((1589877 448...
## 6  MULTIPOLYGON (((1419643 714...
## 7  MULTIPOLYGON (((825795.4 87...
## 8  MULTIPOLYGON (((813914 8334...
## 9  MULTIPOLYGON (((1399432 501...
## 10 MULTIPOLYGON (((1400114 499...

Figure 1 - Interactive Scatter Plot

# Create the base ggplot
my_plot <- ggplot(
  data = flwalkability,
  mapping = aes(x = AvgP_Wrk, y = AvgP_LowW, color = AvgNWI)) +
  geom_point(aes(text = paste(
    "CBSA Name: ", CBSA_Name, "<br>",
    "Average District Population: ", AvgDisPop
  )), size = 4) +
  scale_color_viridis_c() +
  labs(
    title = "Average Portion of the Population that is Working Age vs Low Wage",
    x = "Average Portion of the Population that is Working Age",
    y = "",
    color = "AvgNWI"
  ) +
  theme_minimal()
## Warning in geom_point(aes(text = paste("CBSA Name: ", CBSA_Name, "<br>", :
## Ignoring unknown aesthetics: text
# Convert the ggplot to an interactive plotly plot
interactive_plot <- ggplotly(my_plot, tooltip = "text")

interactive_plot

The original plan for the plot shown above is to make an interactive plot that shows the relationship between the average working-age population (AvgP_Wrk) and average low-wage population (AvgP_LowW) across different CBSA regions in Florida, with points color-coded by the average National Walkability Index (AvgNWI). I wanted to display the CBSA name of the point and the average district population when I hover my mouse over the points. To make this plot interactive, plotly was used. This plot was the easiest plot to create and I did not encounter any difficulties creating it. An additional approach I could implement to explore this data further is to add a trend line to highlight patterns in the scatterplot. I could also add more information when I hover over each point.

This plot allows us to exploration of how characteristics like working-age population and low-wage employment correlate across different areas. Additionally, this plot tells the story of how these characteristics affect the national walkability index scores. From the graph, it appears that low-wage employment is negatively correlated with walkability, and the working-age population is positively correlated with walkability. This information can be used to influence policies related to employment and urban planning. One way I applied data science principles to this plot is I keeping the color the same for the AvgNWI variable as the second plot. Additionally, the graph is kept minimal and the points are sized so that they are easy to interact with. The labels on each point are easy to interact with and understand.

Figure 2 - Spatial Plot

ggplot(data = users_map) +
  geom_sf(aes(fill = AvgNWI), size = 0.1, color = "gray80") +
  scale_fill_viridis_c(option = "viridis", name = "Walkability Index") +
  labs(title = "Average National Walkability Index in CBSA Regions of Florida",
       caption = "The Gray areas on the Graph represent areas without walkability Data") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
    axis.text = element_blank(),
    axis.title = element_blank(),
    axis.ticks = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    plot.caption = element_text(hjust = 1)
  ) +
  guides(
    fill = guide_colorbar(
      title.position = "top",
      title.hjust = 0.5,
      title.vjust = 1,
      title.theme = element_text(size = 8) 
    ))

The original chart planned for this figure was a choropleth map displaying the average National Walkability Index (AvgNWI) across CBSA regions in Florida. For this chart, the walkability data was merged with the geographic data for the state of Florida. To make this graph, I tested a variety of colors to see which color would suit the missing data the best. I decided that light gray would be the best color because it was the least distracting. Figuring out the best color scheme was the main difficulty that I encountered when making this graph. Additionally, I initially had some difficulty merging the data to create this graph. One thing I could add to the graph is a label for the most walkable area.

This map tells the story of how walkable different regions of Florida are on average. This plot can be used by policymakers to identify regions with high walkability scores in Florida. They could then look into the urban planning policies in areas with a high walkability score. The graph used a variety of data visualization principles including keeping a consistent color scheme/gradient to represent the AvgNWI values effectively. Additionally, the design is kept minimal so the data can be focused on the map.

Figure 3

# Load necessary libraries
library(broom)

# Fit multiple linear regression model
model <- lm(AvgNWI ~ AvgP_Wrk + AvgDisPop + TotCPop + AvgP_LowW, data = flwalkability)

# Use broom::tidy to extract coefficients and their confidence intervals
coefficients <- tidy(model, conf.int = TRUE) %>%
  filter(term != "(Intercept)")  # Remove intercept from plotting

# Plot coefficients with confidence intervals
ggplot(coefficients, aes(x = estimate, y = fct_rev(term))) +
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 0, color = "violet") +
  labs(
    title = "Coefficients of Multiple Linear Regression Model",
    x = "Coefficient",
    y = ""
  ) +
  theme_minimal()

This plot displays the coefficients with confidence intervals from a multiple linear regression model predicting AvgNWI using AvgP_Wrk, AvgDisPop, TotCPop, and AvgP_LowW. To make this plot, a multiple linear regression model called model was created that used AvgNWI as the dependent variable and the other variables as predictors. Then the coefficients and their confidence intervals were extracted. The main difficulty that I encountered creating this plot was deciding which plot based on a model that I wanted to create. One additional piece of information that I could add to the plots is the p-values. I was motivated to create this plot because using a multiple linear regression model is an excellent way to understand the impact each predictor variable has on the AvgNWI variable. This approach not only quantifies their impact but also aids in predicting AvgNWI for new areas or scenarios.

This plot tells the story of how much of an impact each of the predictor variables has on the average National Walkability Index value (AvgNWI). By displaying the coefficients, the plot provides insights into which factors most strongly influence the walkability of an area. This plot visually represents the impact of each predictor variable on the National Walkability Index (AvgNWI), providing insights into which factors most strongly influence walkability. The plot shows that the AvgP_Wrk and AvgP_LowW have the most impact on the walkability with the other two variables having zero impact on the walkability. One data visualization principle that was applied in this graph is minimalism. Additionally, I used color coding to draw the viewer’s attention to the zero coefficient line.

Figure 4 - Heat Map

# Calculate the correlation matrix, convert it to a long format for ggplot, and add rounded correlation values
cor_matrix <- cor(flwalkability %>% select(-CBSA_Name), use = "complete.obs")
cor_matrix_long <- as.data.frame(as.table(cor_matrix))
cor_matrix_long$nice_cor <- round(cor_matrix_long$Freq, 2)

# Plot the heatmap
heatmap_plot <- ggplot(cor_matrix_long, aes(x = Var2, y = Var1, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = nice_cor), color = "black", size = 4) +
  scale_fill_gradient2(
    low = "#F63719",
    mid = "white",
    high = "#3B9AB2",
    limits = c(-1, 1)
  ) +
  labs(x = NULL, y = NULL,
       title = "Heat Map of the Correlation Matrix") +
  coord_equal() +
  theme_minimal() +
  theme(panel.grid = element_blank())

heatmap_plot

The original figure planned here was a heatmap displaying the correlation matrix among AvgNWI, AvgP_Wrk, AvgDisPop, TotCPop, and AvgP_LowW. For this heat map, the correlation coefficient was first computed, converted into a long format, and rounded for ggplot visualization. This plot visualizes the strength and direction of correlations among variables and helps us identify potential relationships and dependencies. The main difficulty that I encountered with this plot was deciding whether or not to include it. In the end, I decided I would keep it because it adds to the previous graphs.

This plot tells the story of how correlated the variables being analyzed in this report are to each other. Some of the data visualization principles that are applied in this graph include utilizing a color scheme that is good for positive versus negative values. Additionally, I used a minimal theme and only labeled parts that needed labels.

Conclusions

The findings and visualizations generally confirm assumptions about the relationships between walkability and the predictor variables. Higher walkability tends to correlate with higher working-age populations percentages and slightly lower proportions of low-wage workers. This suggests that areas with better walkability might attract a more economically active and potentially higher-earning population. Higher walkability areas also tend to have higher populations.

References

U.S. Environmental Protection Agency. (2021). National Walkability Index. Data.gov. Retrieved June 15, 2024, from https://catalog.data.gov/dataset/walkability-index1/

U.S. Environmental Protection Agency. (2021). Smart location mapping. Retrieved June 15, 2024, from https://www.epa.gov/smartgrowth/smart-location-mapping

Healy, K. (2019). Data visualization: A practical introduction. Retrieved from https://socviz.co/refineplots.html#refineplots